Risk Factor Hunting for Alzheimer???s Disease Using Rough Set Theory on Lead Investigator: Geeta Karadkhele Institution : Georgia State University E-Mail : gkaradkhele1@student.gsu.edu Proposal ID : 551 Proposal Description: We plan to use the rough set theory to identify genes affecting disease risk from the data collected by NACC. The problem of searching for genetic and environmental (lifestyle) factors that influence common complex traits and the characterization of the effects of those factors has attracted extensive research interest. In recent years, detecting of disease related genes has been revolutionized by the success of genome-wide association (GWA) studies on case-control data. While most of these studies using a single-locus analysis strategy has been routine and yield many interesting findings, in which the analysis only focuses on susceptibility of individual SNPs, these findings cannot completely explain the genetic causes of complex diseases, since complex disease is the result of existence of interactions between loci. Therefore, identifying multiple-locus association patterns of complex diseases has attracted more attentions. To expand the applications of rough sets in the field of data mining and knowledge discovery from big data, we proposed a parallel method for computing approximations based on rough sets and MapReduce. Furthermore, a parallel method for knowledge acquisition using MapReduce is presented. Based on these work, we discuss about rough set based parallel large-scale methods for knowledge acquisition in GWA studies. The corresponding parallel algorithms are designed for knowledge acquisition on the basis of the characteristics of case-control data. The designed algorithm will be implemented on several representative MapReduce runtime systems, including Hadoop, Phoenix and Twister. We will test these algorithms on their runtime systems and compare their performance. Previous comprehensive experimental results demonstrate that algorithm based on rough set and cloud technique can effectively process very large data sets, especially be suitable for large GWA studies data.